Search CORE

8 research outputs found

Self-Supervised Audio-Visual Co-Segmentation

Author: Gan Chuang
McDermott Josh
Rouditchenko Andrew
Torralba Antonio
Zhao Hang
Publication venue
Publication date: 18/04/2019
Field of study

Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data. In this paper we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image pixels to sounds. Here, we introduce a learning approach to disentangle concepts in the neural networks, and assign semantic categories to network feature channels to enable independent image segmentation and sound source separation after audio-visual training on videos. Our evaluations show that the disentangled model outperforms several baselines in semantic segmentation and sound source separation.Comment: Accepted to ICASSP 201

arXiv.org e-Print Archive

Crossref

DSpace@MIT

UAVM: Towards Unifying Audio and Visual Models

Author: Glass James
Gong Yuan
Liu Alexander H.
Rouditchenko Andrew
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/02/2023
Field of study

Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.Comment: Published in Signal Processing Letters. Code at https://github.com/YuanGongND/uav

arXiv.org e-Print Archive

Contrastive Audio-Visual Masked Autoencoder

Author: Glass James
Gong Yuan
Harwath David
Karlinsky Leonid
Kuehne Hilde
Liu Alexander H.
Rouditchenko Andrew
Publication venue
Publication date: 10/03/2023
Field of study

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task. Code and pretrained models are at https://github.com/yuangongnd/cav-mae.Comment: Accepted at ICLR 2023 as a notable top 25% paper. Code and pretrained models are at https://github.com/yuangongnd/cav-ma

arXiv.org e-Print Archive

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Author: Chuang Yung-Sung
Feris Rogerio
Glass James
Harwath David
Karlinsky Leonid
Kingsbury Brian
Kuehne Hilde
Rouditchenko Andrew
Shvetsova Nina
Thomas Samuel
Publication venue
Publication date: 07/10/2022
Field of study

Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers

arXiv.org e-Print Archive

Learning Audio-Video Language Representations

Author: Rouditchenko Andrew
Publication venue: Massachusetts Institute of Technology
Publication date: 17/06/2021
Field of study

Automatic speech recognition has seen recent advancements powered by machine learning, but it is still only available for a small fraction of the more than 7,000 languages spoken worldwide due to the reliance on manually annotated speech data. Unlabeled multi-modal data, such as videos, are now increasingly available in many different languages and provide opportunities to scale speech technologies. In this thesis, we introduce models and datasets for learning visually grounded spoken language from raw audio in videos. We propose a self-supervised audio-video model that learns from the English narration naturally present in instructional videos to relate spoken words and sounds to visual content. Our model can recognize spoken words and natural sounds in audio queries to retrieve relevant visual clips, supporting its application to video search directly using audio and spoken queries, without needing to transcribe speech to text. We further demonstrate that our model can learn multilingual audiovideo representations and can successfully perform retrieval on Japanese videos. Since our approach only requires audio-visual data without transcripts, we believe it is a promising direction to enable novel speech processing tools.M.Eng

DSpace@MIT

Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Author: Barbu Andrei
Glass James
Katz Boris
Palmer Ian
Rouditchenko Andrew
Publication venue: Center for Brains, Minds and Machines (CBMM), The 22nd Annual Conference of the International Speech Communication Association (Interspeech)
Publication date: 30/08/2021
Field of study

Visually-grounded spoken language datasets can enable models to learn cross-modal correspon- dences with very weak supervision. However, modern audio-visual datasets contain biases that un- dermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effec- tively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a biascontrolled image dataset that features similar image classes to those present in ImageNet. We detail our data collection pipeline, which features several methods to improve caption quality, including automated language model checks. Lastly, we show baseline results on image retrieval and audio re- trieval tasks. These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned. We also show evidence that the performance decrease is due to the dataset controls, and not the transfer setting.This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216

arXiv.org e-Print Archive

DSpace@MIT

The Sound of Pixels

Author: Gan Chuang
McDermott Joshua Hartman
Rouditchenko Andrew
Torralba Antonio
Vondrick Carl Martin
Zhao Hang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/07/2019
Field of study

We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel. Our approach capitalizes on the natural synchronization of the visual and audio modalities to learn models that jointly parse sounds and images, without requiring additional manual supervision. Experimental results on a newly collected MUSIC dataset show that our proposed Mix-and-Separate framework outperforms several baselines on source separation. Qualitative results suggest our model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources. Keywords: Cross-modal learning; Sound separation and localizationNational Science Foundation (U.S.) (Grant IIS-1524817

DSpace@MIT